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Abstract. We present a novel Natural Evolution Strategy (NES) vari- 
ant, the Rank-One NES (Rl-NES), which uses a low rank approxima- 
tion of the search distribution covariance matrix. The algorithm allows 
computation of the natural gradient with cost linear in the dimensional- 
ity of the parameter space, and excels in solving high-dimensional non- 
separable problems, including the best result to date on the Rosenbrock 
function (512 dimensions). 

1 Introduction 

Black-box optimization (also called zero-order optimization) methods have re- 
ceived a great deal of attention in recent years due to their broad applicability to 
real world problems [7,9-11, 15, 18]. When the structure of the objective function 
is unknown or too complex to model directly, or when gradient information is 
unavailable or unreliable, such methods arc often seen as the last resort because 
all they require is that the objective function can be evaluated at specific points. 

In continuous black-box optimization problems, the state-of-the-art algo- 
rithms, such as xNES [4] and CMA-ES [8], are all based the same principle [1,4]: 
a Gaussian search distribution is repeatedly updated based on the objective 
function values of sampled points. Usually the full covariance matrix of the dis- 
tribution is updated, allowing the algorithm to adapt the size and shape of the 
Gaussian to the local characteristics of the objective function. Full parameteriza- 
tion also provides invariance under afRne transformations of the coordinate sys- 
tem, so that ill-shaped, highly non-separable problems can be tackled. However, 
this generality comes with a price. The number of parameters scales quadrati- 
cally in the number of dimensions, and the computational cost per sample is at 
least quadratic in the number of dimensions [13], and sometimes cubic [4,16]. 

This cost is often justified since evaluating the objective function can domi- 
nate the computation, and thus the main focus is on the improvement of sam- 
pling efficiency. However, there are many problems, such as optimizing weights 
in neural networks where the dimensionality of the parameter space can be very 
large (e.g. many thousands of weights), the quadratic cost of updating the search 
distribution can become the computational bottleneck. One possible remedy is 
to restrict the covariance matrix to be diagonal [14], which reduces the compu- 
tation per function evaluation to 0{d), linear in the number d of dimensions. 



Unfortunately, this "diagonal" approach performs poorly when the problem is 
non-separable because the search distribution cannot follow directions that are 
not parallel to the current coordinate axes. 

In this paper, we propose a new variant of the natural evolution strategy 
family [17], termed Rank One NES (Rl-NES). This algorithm stays within the 
general NES framework in that the search distribution is adjusted according to 
the natural gradient [2], but it uses a novel parameterization of the covariance 
matrix, 



where u and a arc the parameters to be adjusted. This parameterization allows 
the predominant eigen direction, u, of C to be aligned in any direction, enabling 
the algorithm to tackle highly non-separable problems while maintaining only 
O (d) parameters. We show through rigorous derivation that the natural gradient 
can also be effectively computed in O (d) per sample. Rl-NES scales well to high 
dimensions, and dramatically outperforms diagonal covariance matrix algorithms 
on non-separable objective functions. As an example, Rl-NES reliably solves the 
the non-convex Rosenbrock function up to 512 dimensions. 

The rest of the paper is organized as follows. Sectio 2, briefly reviews the NES 
framework. The derivation of Rl-NES is presented in Section 3. Section 4 empir- 
ically evaluates the algorithm on standard benchmark functions, and Section 5 
concludes the paper. 



2 The NES framework 

Natural evolution strategics (NES) are a class of evolutionary algorithms for 
real-valued optimization that maintain a search distribution, and adapt the dis- 
tribution parameters by following the natural gradient of the expected function 
value. The success of such algorithms is largely attributed to the use of natural 
gradient, which has the advantage of always pointing in the direction of the steep- 
est ascent, even if the parameter space is not Euclidean. Moreover, compared 
to regular gradient, natural gradient reduces the weights of gradient compo- 
nents with higher uncertainty, therefore making the update more reliable. As a 
consequence, NES algorithms can effectively cope with objective functions with 
ill-shaped landscapes, especially preventing premature convergence on plateaus 
and avoiding overaggressive steps on ridges [16]. 

The general framework of NES is given as follows: At each time step, the 
algorithm samples n e N new samples Xi, . . . , x„ ~ tt (-j^), with tt (•|6') being the 
search distribution parameterized by 9. Let / : M'' i-^ M be the objective function 
to maximize. The expected function value under the search distribution is 



C = a' 



■2 (l + uu^) 




Using the log-likelihood trick, the gradient w.r.t. the parameters can be written 
as 



V0J = Ve j f (x) TT {x\9) dx 



= E[f{x)V0logTr{x\6)]. 
from which we obtain the Monte-Carlo estimate 
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VgJc^ - V/(a;i) Velog7r(a;i|6'). 
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of the sciardi gradient. The key step of NES then consists of replacing this gra- 
dient by the natural gradient 

where 



F = E 



V0log7r(.T,|6l) V0log7r(x,|(?)^ 



is the Fisher information matrix (See Fig.l for an illustration). This leads to a 
straightforward scheme of natural gradient ascent for iteratively updating the 
parameters 

n 

= 6+-Y.f(''i)P~^elog^{x^e) 
1=1 



= 6+-Y.f{xi)Velog7r{xi\e). (1) 
1=1 

The sequence of 1) sampling an offspring population, 2) computing the corre- 
sponding Monte Carlo estimate of the gradient, 3) transforming it into the nat- 
ural gradient, and 4) updating the search distribution, constitutes one iteration 
of NES. 

The most difficult step in NES is the computation of the Fisher information 
matrix with respect to the parameterization. For full Gaussian distribution, the 
Fisher can be derived analytically [4,16]. However, for arbitrary parameterization 
of C, the Fisher matrix can be highly non-trivial. 



3 Natural gradient of the rank-one covciriance matrix 
approximation 



In this paper, we consider a special formulation of the covariance matrix 

C = a^{l + uu^) , 



with parameter set 9 = (cr, u). The special part of the parameterization is the 
vector u € R"^, which corresponds to the predominant direction of C. This allows 
the search distribution to be aligned in any direction by adjusting u, enabling 
the algorithm to follow valleys not aligned with the current coordinate axes, 
which is essential for solving non-separable problems. 

Since a should always be positive, following the same procedure in [4], we 
parameterize cr = e'^, so that A e K can be adjusted freely using gradient descent 
without worrying about a becoming negative. The parameter set is adjusted to 
9 = (A, u) accordingly. 

Prom the derivation of [16] , the natural gradient on the sample mean is given 

by 

V^logp{x\e) = x- fi. (2) 

In the subsequent discussion we always assume /x = for simplicity. It is straight- 
forward to sample from ^^ (0, C)^ by lettting y ^ J\f [0,1), z J\f {0,1), then 

x = a{y + zu)--Af{0,C). 

The inverse of C can also be computed easily as 



c- 



1 + ^2 



7UU 



where = vJu. Using the relation det (/ + uvJ) = 1 + vJu, the determinant 
of Cis 

\C\ = a'''{l + r^). 
Knowing C~ and |C| allows the log-likelihood to be written explicitly as 



logp {x\9) = const — ^ log |C| — ^2;^C x 

= const — Xd — ^ log (l -|- r^) — ^e" 



^^x'^x- 



1 e 



-2A 



2 1 -|-r2 

The regular gradient with respect to A and u can then be computed as: 

^ (x^u)^\ 

X ' X 



Va logp {x\9) = -d + e~^^ ' 



1 + 



(3) 



Vulogp{x\9) = -^-^+e 
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u (x^u) . 
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(l + r2)' ' l + r2 



(4) 



Replacing x with e'^ (y + zu) , then the Fisher can be computed by marginal- 
izing out i.i.d. standard Gaussian variables y and z, namely, 



Vfl logp {x\9) Ve logp {x\9)^ 



Ey^^ \Ie\ogp{v + zu\e)Ve log p{y + zu\9) 



^ For succinctness, we always assume the mean of the search distribution is 0. This 
can be achieved easily by shifting the coordinates. 



Since elements in V log p{x\0)V0 log p{x\9) are essentially polynomials of y 
and z, their expectations can be computed analytically^, which gives the exact 
Fisher information matrix 

2u Q 



with 



B 



l + r2 



l+r 



Let V = u/r, then 



F = 



1 +r2 

The inverse of F is thus given by 
^_ 1 + 



2rf 



2V 
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We apply the formula for block matrix inverse in [12] 
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where Ci = Aw — ^12^22^21, and C2 = A22 — ^21^11^12 are the Schur com- 
plements. Let F be partitioned as above, then 



B- =1- 
and the Shur complements are 



Ci =2d 
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and 



= 2{d-l) 

C2=I + 

= 1 + 
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1 + r2 



rf(l + r2) 



^ The derivation is tedious, thus omitted here. All derivations are numerically verified 
using Monte-Carlo simulation. 



whose inverse is given by 



2 + (r2 - 1) ^ 



2(rf- 1) 

Combining the results gives the analytical solution of the inverse Fisher: 



2r2 {d - 1) 
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2{d-l) 7 + [2 + d (r^ - l)] vv'^ 



Multiplying F with the regular gradient in Eq.3 and Eq.4 gives the natural 
gradient for A and u: 



vxiogp{x\e) 



1 



2{d-l) 



(5) 



and 

Vuiogp{x\9) 



-2A 



2(d- l)r 



(1 - d) (x^v)^ + (r^ + 1) ((x^t;)' - x^x)] . (6) 



Note that computing both ^\\ogp{x\9) and V„logp(a;|0) requires only the 
inner products x^a; and x'^'v, therefore can be done O {d) storage and time. 



3.1 Reparameterization 

The natural gradient above is obtained with respect to u. However, direct gradi- 
ent update on u has an unpleasant property when V„ logp {x\9) is in the opposite 
direction of u, which is illustrated in Fig. 2(a). In this case, the gradient tends 
to shrink u. However, if V„ logp (a;|6') is large, adding the gradient will flip the 
direction of u, and the length of u might even grow. This causes numerical prob- 
lems, especially when r is small. A remedy is to separate the length and direction 
of u, namely, reparameterize u = e'^v, where \\v\\ = 1 and e'^ is the length of u. 
Then the gradient update on c will never flip u, and thus avoid the problem. 
Note that for small change 6u, the update on c and v can be obtained from 

5c = - log {u + Suf^ {u + Su) — c 
~ ^ log (u^u + 2dvJuj — c 
= -logu u+-\og[l + -^)-c 

vJu 



Algorithm 1: Rl-NES(n) 



1 while not terminate do 

2 for i = 1 to n do 

3 yi^M{Q,I) 

4 zi <f-A^(0, 1) 

5 a::^ ^/(yi + Zi'u) //generate sample 

6 fitness[i] <— /(/X + Xj) 

7 end 

8 Compute the natural gradient for /j, A, u, c, and v according to Eq.2, 5, 6, 
7, and 8, and combine them using Eq.l 

9 n ^/i + r/V,,J 

10 A <— A + rjVxJ 

11 if Vc logp {x\6) < then 

12 c-*— c + r?VcJ 

14 « <-e''t; 

15 else 

16 M + r?V„J //additive update 

17 c log 

19 end 

20 end 



and 



dv 



u + du 



u + 6u 



u + Su 



- SvJu 
OU =i= — u 



The natural gradient on c and v is given by letting 5u oc V„ logp {x\9), thanks 
to the invariance property: 



Vclogp(a:|6') = r ^V„ logp (a;|6')^ w 

V„ logp {x\6) = V„ logp {x\9) - logp {x\6)'^ v 



(7) 
(8) 



Note that computing Vg logp {x\9) and logp (a;|^) involves only inner products 
between vectors, which can also be done linearly in the number of dimensions. 

Using the parameterization (c, v) introduces another problem. When r is 
small, Vciogp{x\9) tends to be large, and thus directly updating c causes r to 
grow exponentially, resulting in numerical instability, as shown in Fig. 2(b). In 
this case, the additive update on u, rather than the update on (c, v) is more sta- 
ble. In our implementation, the additive update on u is used if Vg logp {x\9) > 0, 
otherwise the update is on (c, v) . This solution proved to be numerially stable in 
all our tests. Algorithm 1 shows the complete Rl-NES algorithm in pseudocode. 



4 Experiments 

The Rl-NES algorithm was evaluated on the twelve noise-free unimodal func- 
tions [6] in the 'Black-Box Optimization Benchmarking' collection (BBOB) from 
the 2010 GECCO Workshop for Real-Parameter Optimization. In order to make 
the results comparable those of other methods, the setup in [5] was used, which 
transforms the pure benchmark functions to make the parameters non-separable 
(for some) and avoid trivial optima at the origin. 

Rl-NES was compared to xNES [3], SNES [14] on each benchmark with 
problem dimensions d = 2*^, k = {1..9} (20 runs for each setup), except for 
xNES, which was only run up fc = 6, li = 64. Note that xNES serves as a 
proper baseline since it is state-of-the-art, achieving performance on par with 
the popular CMA-ES. The reference machine is an Intel Core 17 processor with 
1.6GHz and 4GB of RAM. 

Fig. 3 shows the results for the eight benchmarks on which Rl-NES performs 
at least as well as the other methods, and often much better. For dimensional- 
ity under 64, Rl-NES is comparable to xNES in terms of the number of fitness 
evaluations, indicating that the rank-one parameterization of the search distri- 
bution effectively captures the local curvature of the fitness function (see Fig. 5 
for example). However, the time required to compute the update for the two 
algorithms differs drastically, as depicted in Fig. 4. For example, a typical run 
of xNES in 64 dimensions takes hours (hence the truncated xNES curves in all 
graphs), compared to minutes for Rl-NES. As a result, Rl-NES can solve these 
problems up to 512 dimensions in acceptable time. In particular, the result on 
the 512-dimcnsional Rosenbrock function is, to our knowledge, the best to date. 
We estimate that optimizing the 512-dimensional sphere function with xNES (or 
any other full parameterization method, e.g. CMA-ES) would take over a year in 
computation time on the same reference hardware. It is also worth pointing out 
that sNES, though sharing similar, low complexity per function evaluation, can 
only solve separable problems (Sphere, Linear, AttractiveSector, and Ellipsoid). 

Fig. 6 shows four cases (Ellipsoid, StepEUipsoid, RotatcdEllipsoid, and Tablet) 
for which Rl-NES is not suited, highlighting a limitation of the algorithm. Three 
of the four functions are from the Ellipsoid family, where the fitness functions 



are variants of the type 



d ^^^^ ^ 

f{xi,...,Xd) = ^Xi 
i=l 

The eigenvahics of the Hessian span several orders of magnitude, and the param- 
eterization with a single predominant direction is not enough to approximate the 
Hessian, resulting in poor performance. The other function where Rl-NES fails 
is the Tablet function where all but a one eigendirection has a large eigenvalue. 
Since the parameterization of Rl-NES only allows a single direction to have a 
large eigenvalue, the shape of the Hessian cannot be effectively approximated. 

5 Conclusion and future work 

We presented a new black-box optimization algorithm Rl-NES that employs a 
novel parameterization of the search distribution covariance matrix which allows 
a predominant search direction to be adjusted using the natiiral gradient with 
complexity linear in the dimensionality. The algorithm shows excellent perfor- 
mance in a number of high- dimensional non-separable problems that, to date, 
have not been solved with other parameterizations of similar complexity. 

Future work will concentrate on overcoming the limitations of the algorithm 
(shown in Fig 6). In particular, we intend to extend the algorithm to a) incor- 
porate multiple search directions, and b) enable each search direction to shrink 
as well as grow. 
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Fig. 1. Plain versus natural gradient in parameter space. Consider two pa- 
rameters, e.g. 9 = (/i,(T), of the search distribution. On the left, the solid (black) 
arrows indicate the gradient samples Ve log 7r(a; | the dotted (blue) arrows corre- 
spond to fix) ■ Vg log7r(a; | 9), that is, the same gradient estimates, but scaled with fit- 
ness. Combining these, the bold (green) arrow indicates the (sampled) fitness gradient 
Ve J, while the bold dashed (red) arrow indicates the corresponding natural gradient 
Ve J = F~V eJ . Being random variables with expectation zero, the distribution of the 
black arrows is governed by their covariance (the gray ellipse). Note that this covari- 
ance is a quantity in parameter space (where the 9 reside) ; not to be confused with that 
of the search space (where the samples x reside). In contrast, on the right, the solid 
(black) arrows represent Ve log7r(a; | 9), and dotted (blue) arrows indicate the natural 
gradient samples f(x) ■ Vg log7r(a; | 9), resulting in the natural gradient (dashed red). 
The covariance of the solid arrows on the right hand side turns out to be the inverse 
of the covariance of the solid arrows on the left. This has the effect that when comput- 
ing the natural gradient, directions with high variance (uncertainty) are penalized and 
thus shrunken, while components with low variance (high certainty) are boosted, since 
these components of the gradient samples deserve more trust. This makes the (dashed 
red) natural gradient a much more trustworthy update direction than the (green) plain 
gradient. 



(a) 



(b) 



Fig. 2. Illustration of the change in parameterization. In both panels, the black 
lines and ellipses refer to the current predominant direction u, and the corresponding 
search distribution from which samples are drawn. The black cross denotes one such 
sample that is being used to update the distribution. In the left panel, the direction of 
the selected point is almost perpendicular to u, resulting in a large gradient, reducing 
u (the dotted blue line). However, direct gradient update on u will flip the direction 
of u. As a result, u stays in the same undesired direction, but with increased length. 
In contrast, performing update on c and v gives the predominant search direction 
depicted in the red, with u shrunk properly. The right panel shows another case where 
the selected point aligns with the search direction, and performing the exponential 
update on c and v causes u to increase dramatically (green line & ellipsoid). This effect 
is prevented by performing the additive update (Eq. 6) on u (red line & ellipsoid). 
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Fig. 3. Performance comparison on BBOB unimodal benchmarks. Log-log 
plot of the median number of fitness evaluations (over 20 trials) required to reach the 
target fitness value of — 10~* for unimodal benchmark functions for which Rl-NES is 
well suited, on dimensions 2 to 512 (cases for which 90% or more of the runs converged 
prematurely are not shown). Note that xNES consistently solves all benchmarks on 
small dimensions (< 64), with a scaling factor that is almost the same over all functions. 
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Fig. 4. Computation time per function evaluation, for the three algorithms, on 
problem dimensions ranging from 2 to 512. Both SNES and Rl-NES scale linearly, 
whereas the cost grows cubically for xNES. 




Fig. 5. Behavior of Rl-NES on the 32-dimensional cigar function: 

/ (xi, . . . ,Xd) = l(f'x\ + a;2 + • • ■ + x\. The left panel shows the best fitness found 
so far, and the min and max fitness in the current population. The right panel shows 
how A and c evolve over time. Note that the A decreases almost linearly, indicating that 
all the other directions except the predominant one shrink exponentially. In contrast, c 
first increases, and then stabilizes around log 1000 (the black line). As a result, I-\-wtJ 
corresponds to the Hessian of the cigar function [lO**, 1, . . . , l] . 
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Fig. 6. Performance comparison on BBOB unimodal benchmarks for which 
Rl-NES is not well suited. For these four functions a single eigendirection is not 
enough. Not that SNES solves the Ellipsoid function because it is separable. 



